Here, we provide some background information useful for non-technical audiences and first-time users of Census Data.
The US Census Bureau provides current data on America’s people, places, and economy. Their mission is to serve as the nation’s leading provider of quality data about its people and economy. The Census Bureau works to honor privacy, protect confidentiality, share their expertise globally, and conduct their work openly. 1
This documentation will focus on the American Community Survey (ACS), an ongoing annual survey that shows what the U.S. population looks like and how it lives.2 The ACS is a tool that can help communities decide where to target services and resources by helping them understand the changes happening in their communities. It is an important source of detailed population and housing information.
The ACS releases new data each year in the form of estimates. These estimates are presented in a variety of tables, tools, and analytical reports. The ACS data can be accessed through the Census Bureau’s platform, data.census.gov. The ACS has 1-year 3 and 5-year estimates of the data 4. It is important to consider currency and sample size/reliability/precision in choosing which dataset to use. The table “Distinguishing features of ACS 1-year, 1-year supplemental, 3-year, and 5-year estimates” 5 summarizes details, research implications, and examples to help guide data users.
For the purposes of this documentation, we will focus on accessing ACS 5-year estimates.
For the scope of this project, we will be focusing on the following variables.
The DP05 table in the ACS provides detailed information on race and ethnicity of the population. The data profile (DP) tables provide a lot of popular statistics across subjects - in this case, demographic and housing characteristics - in a single table. We are particularly intersted in counts and percentages of various racial and ethnic groups in a given area. The US Census defines race here as a subset based on responses to the race or Hispanic/Latino-origin questions.
These provide the following possible categories:
| Table Codes | Race/Ethnicity Label |
|---|---|
| A | White alone |
| B | Black or African American Alone |
| C | American Indian and Alaska Native Alone |
| D | Asian Alone |
| E | Native Hawaiian and Other Pacific Islander Alone |
| F | Some Other Race Alone |
| G | Two or More Races |
| H | White Alone, Not Hispanic or Latino |
| I | Hispanic or Latino |
Out of these nine groups, we consistently focus on the following six across all variables when stratifying by race/ethnicity, this helps prevent overlap in our race/ethnicity categories:
B. Black or African American Alone
C. American Indian and Alaska Native Alone
D. Asian Alone
E. Native Hawaiian and Other Pacific Islander Alone
H. White Alone, Not Hispanic or Latino
I. Hispanic or Latino
The B01001 table provides sex by age - these are the estimated count of men and women in age buckets, mostly of five years (e.g. 25 to 29 years, 30 to 34 years). Certain age groups - including the later teens and in the 60’s - have brackets of differing sizes; there is also columns for just 20-year-olds and for just 21-year-olds.
The letters A through I provide versions of this table by race and ethnicity, these letter codes follow the same race/ethnicity codes discussed earlier. For example, B01001A provides sex by age for White alone and B01001B provides sex by age for Black or African American alone. Aggregating by sex and age for each of these tables provides the number of children by each race and ethnicity. This can be useful for understanding the demographic composition of children within a community.
The S1701 table provides data on poverty rates, including the number and percentage of individuals and families living below the poverty line. The ACS is the only source for official poverty data down to the community level, making it invaluable as the official poverty measure has direct policy implications, such as the allocation of federal funds to help low income individuals and families at the state and locality levels. The S1701 table provides poverty status in the past 12 months by age, sex, race and ethnicity, educational attainment, employment status, and work experience. The S1702 table provides poverty status in the past 12 months of families. Other tables include B17001 (poverty status in the past 12 months by sex by age), B18131 (age by ratio of income to poverty level in the past 12 months by disability status and type), and B17022 (ratio of income to poverty level in the past 12 months of families by family type by presence of related children under 18 years by age of related children).
While people colloquially refer to the poverty level as a singular concept, in reality, the Census Bureau uses a set of poverty thresholds to determine who is in poverty. Fourty-eight different poverty income levels are computed and these values are determined based mostly on family size, and, for 1-2 person households, with separate numbers for senior citizens. Only income before taxes is included and it does not include non-cash benefits such as public housing, Medicaid, and food stamps. Poverty is also not calculated for households, rather the whole household is assigned the poverty status of the person who filled out the form. Poverty is also not determined for people living in group quarters such as prisons, college dormitories, military quarters, and nursing homes. It is also not determined for homeless people unless they are living in shelters. It also does not include children under 15 who are not living with their families such as foster children, children living with non-relatives or on their own.
Note: same-sex married couples were not treated as married by the Census until 2013, even if legally married. Only the income of the householder and any of their children was included in the poverty calculations until that point.
More information on how poverty is determined by the Census Bureau can be found at How the Census Bureau Measures Poverty6.
Most income data is reported at the household level, but there are tables which are focused on family households, as well as some which report on individual income. Income is reported in a number of ways: as a median value, or as a series of medians for groups within a total population; as a number of people or households earning in a certain bracket; as well as a few other structures. In order to calculate median household income, we consider the B19001 tables, which provide frequency counts for the number of households which have median incomes that fall within each of the income brackets below:
We cannot aggregate medians the same way we do for other measures, instead we use the following guide to calculate an approximate median based on frequency of median income brackets7.
When considering these tables, it is important to note that each race/ethnicity group is given its own sub-table. The letters A through I provide versions of this table by race and ethnicity, these letter codes follow the same race/ethnicity codes discussed earlier. In order to gain the median household income, we will have to pull in each of the sub-tables for our race/ethnicity of interest and then join them together.
The S2701 table offers insights into the number and percentage of individuals without health insurance coverage. This can be crucial for understanding healthcare access within a community. The American Community Survey added questions about health insurance in 2008. The data shows health insurance coverage broken up by the following categories: age, sex, race, living arrangement, citizenship status, disability status, educational attainment, employment status, and household income. Edits were made to the health care questions in 2009 that make direct comparison with the first year responses impossible without adjustments. In 2012, tables were added to count health insurance coverage for people aged 19 to 25.
The B27001 table specifically focuses on children without health insurance, providing detailed information on their demographic characteristics and uninsured rates. The American Community Survey (ACS) includes questions regarding health insurance status, which were added in 2008 and have been updated in the following years. The B27001 table series provides fine-grained information about general health insurance coverage, broken down into 9 age groups and 9 race/ethnicity groups.
Note: for broader age categories consider table C27001, containing three age groups (child, adult, senior citizen). For other insurance-related information, consider tables B27002 - B27023.
When considering these tables, it is important to note that all data is disaggregated by age and each race/ethnicity group is given its own sub-table. The letters A through I provide versions of this table by race and ethnicity, these letter codes follow the same race/ethnicity codes discussed earlier. In order to gain the counts of Children without Health Insurance, we will have to pull in each of the sub-tables for our race/ethnicity of interest and then combine.
The ACS provides data on employment for many geographic levels and localities, including their employment, retirement, occupation, and industry. The Census Bureau asks a number of questions about whether individuals are employed, and if they are not unemployed or retired, about their occupation and industry. Census Reporter collects these together under the tag ‘employment’. When talking about employment status, the Census Bureau divides the population 16 years and older into two categories: in the labor force or not in the labor force. People who have never worked or who are retired are not in the labor force. People who are not currently working but have recently and would like to work are considered in the labor force, but unemployed. People who are actively working are described as either in the civilian labor force or in the armed forces
Note: The ACS may not be the best source for employment information depending on your needs. Consider the Current Population Survey (CPS) provides employment statistics in monthly increments, better suited for longitudinal analysis. Or the state-partnered Longitudinal Employer–Household Dynamics (LEHD).
The American Community Survey (ACS) gathers extensive data about the housing conditions of respondents, including whether they own or rent their home, how much they spend on housing, and the physical characteristics of homes. The ACS primarily reports housing data in tables with codes beginning with 25. Most of the tables count the number of housing units for a given characteristic. However, a few tables estimate the number of people living in owned or rented housing units. A housing unit is anything from a house to an apartment or even a boat if a person is currently living there.
Homeownership percentage is a crucial indicator of housing stability and economic well-being in a community. The S2502 table in the ACS provides valuable information on homeownership rates within different geographic areas and for different race/ethnicity groups.
Now, we move on to the technical aspect of the walkthrough, examining the tools and packages in R that allow us to pull data from the census and how we can leverage those tools to provide the information we need.
R is a programming language and software environment for statistical computing and graphics. It serves as a integrated suite of software facilities for data manipulation, calculation and graphical display, including effective data handling, operators for calculations on arrays/matrices,a collection of intermediate data analysis tools. R is an effective programming language widely used among statisticians and data miners for data analysis and developing statistical software. More information about R and links for installation can be found at at their website. 8
RStudio is an integrated development environment (IDE) for R 9. IDEs are developer environments that provide better tools and UI for working in R. It provides a user-friendly interface to work with R, making it easier to write and run R code, view plots and charts, manage files, and more.
Both must be installed if this is the first time you are working with R, there are numerous guides walking you through the process 1011.If you already have R/RStudio installed, you can skip this section.
tidycensus Packagetidycensus is an R package designed to help R users get
Census data pre-prepared to use with tidyverse 12.
tidycensus is an R package that allows users to interface
with a select number of the US Census Bureau’s data APIs and return
tidyverse-ready data frames, optionally with simple feature geometry
included.
To install the package from an R environment, type and run the following command once:
# Only run ONCE (first time using package on local system)
install.packages("tidycensus")
After the first time installing the package, any future scripts that reference the package can have this line of code:
# Run at the start each time working with R script
library(tidycensus)
once at the top and that will be enough to load the
tidycensus package. You may also want to run
library(tidycensus) at the start to load in a common
domain-general R package for doing data work.
To access ACS data programmatically, you will need to obtain an API key from the Census Bureau. This key allows you to make requests to the Census API and retrieve data directly into your R environment. To gain access to your own API key, access the website here. 13 This site will require an organization name (or personal name if individual), and an email address. You can load the API key into R by running the following code in R:
# Assign the census API key to an object named "census_api"
census_api <- "f1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7q8r9s0t1u2v3w4x5y6z" # NOT REAL KEY
NOTE: Your API key is private information, take care to not share the string value or make it publically avaliable
The variables we described earlier were the focus of this project but there may be scenarios where you are interested in pulling your own case-specific variables.
To find table codes in R, we can create a custom function in R that can combine all the metadata we are interested into a table we can search through. The function is shown here:
# A custom R function that creates a table of all variable codes and metadata
all_acs_meta <- function(year){
# Gets the list of all variables from all acs5 metadata tables
vars1 <- load_variables(year, "acs5") %>% select(-geography) # Remove the geography column
vars2 <- load_variables(year, "acs5/profile")
vars3 <- load_variables(year, "acs5/subject")
vars4 <- load_variables(year, "acs5/cprofile")
# Provides column with specific lookup
vars1$dataset_table <- "acs5"
vars2$dataset_table <- "acs5/profile"
vars3$dataset_table <- "acs5/subject"
vars4$dataset_table <- "acs5/cprofile"
# Combine all table rows
all_vars_meta <- rbind(vars1, vars2, vars3, vars4)
return(all_vars_meta)
}
As we can see the variable metadata being loaded in the function
above only applies to the 5-year ACS results (acs5). If you
are working with the 1-year results, the names would have to be changes
(in that case, to acs1). After running the code above, you
can enter the following command and provide a year as shown below:
# Creates a table of all the metadata called "meta_table"
meta_table <- all_acs_meta(year = 2021)
# Opens the newly made table
View(meta_table)
Instead we will show you the result of the first 5 lines for the table:
| name | label | concept | dataset_table |
|---|---|---|---|
| B01001A_001 | Estimate!!Total: | SEX BY AGE (WHITE ALONE) | acs5 |
| B01001A_002 | Estimate!!Total:!!Male: | SEX BY AGE (WHITE ALONE) | acs5 |
| B01001A_003 | Estimate!!Total:!!Male:!!Under 5 years | SEX BY AGE (WHITE ALONE) | acs5 |
| B01001A_004 | Estimate!!Total:!!Male:!!5 to 9 years | SEX BY AGE (WHITE ALONE) | acs5 |
| B01001A_005 | Estimate!!Total:!!Male:!!10 to 14 years | SEX BY AGE (WHITE ALONE) | acs5 |
| B01001A_006 | Estimate!!Total:!!Male:!!15 to 17 years | SEX BY AGE (WHITE ALONE) | acs5 |
This will allow you to view all the table codes and search for specific codes, there are also descriptions for each variable code and what they are measuring.
Once you get ACS data in R, it’s important to validate the data to ensure the numbers are what you intended to pull. An easy way to do this is by accessing the data from Census Data Tables. 14 Search for the specific table code and narrow down to the locality you are looking for using the filter options on the left. Don’t forget to specify the ACS estimate you are looking for.
Clicking on the “American Community Survey” will open up the following, allowing you to select which estimate you are looking for.
Cross-compare the numbers here from the ones in R to ensure they match.
When pulling Census data at the county-level using
tidycensus we can only pull data for individual counties,
the package contains no way to aggregate values across a range of
counties. In order to do this, we must pull the data in for each county
of interest, then manually agggregate the values across the
counties.
Note: This can only reliably be done for count data, attempting to aggregate other measures (like medians, percentages, etc.) will result in misleading and inaccurate metrics for the combined counties
We begin by determining what our combined area of interest for the ACS data is and where it can be found. In this case, we are interested in results from the year of 2022 for the combined counties of Charlottesville, VA and Albermarle, VA. We define these variables below:
# Census API key
census_api <- Sys.getenv("CENSUS_API_KEY")
# Year for acs5 data pull
year <- 2022
# County FIP codes and name
county_codes <- c("003", "540") # locality FIPS codes desired
name <- "Charlottesville-Albermarle" # name of locality or combined region
We use the get_acs() function from
tidycensus to pull the example variable, Race and
Ethnicity [DP05]. Here is the code that achieves this
result:
# Create list mapping our variable
var_DP05 <- list(
"RaceEthnic_%_Black" = "DP05_0038", # Black/AA
"RaceEthnic_%_AmerIndian" = "DP05_0039", # American Indian/Alaska Native
"RaceEthnic_%_Asian" = "DP05_0044", # Asian
"RaceEthnic_%_PacifIslan" = "DP05_0052", # Native Hawaiian/Pacific Islander
"RaceEthnic_%_HispanLatin" = "DP05_0073", # Hispanic/Latino (any race)
"RaceEthnic_%_White" = "DP05_0079" # White alone (not Hispanic/Latino)
)
# Get ACS data
acs_data_DP05 <- get_acs(geography = "county",
state = "VA",
county = county_codes,
variables = var_DP05,
summary_var = "DP05_0033", # this provides the total (summary table) we need for creating percents
year = year,
survey = "acs5",
key = census_api
)
We first create a dictionary that maps the variable code in the ACS tables to the appropriate category label. For example, the code “DP05_0038” is tied to count of “Black or African-American alone” race/ethnicity respondents.
From there, call the get_acs() function. The arguments
provided as function inputs are the following:
geography = gives the level of geography; this can be
national, state, county, etc.state = here we specify the particular state we are
consideringcounty = this is provided with a vector list (in this
case, called county_codes) of the FIPS codes for the
counties we want to pull data fromsummary_var = allows us to define the variable code for
the summary statistics; by including totals, we can calculate
percentages and proportions from the counts for each groupyear = sets the time period of the survey we want the
values fromsurvey = defines the type of ACS survey we are
interested, here we want the 5-year ACS values (i.e. “acs5”)key = is where we provide the necessary Census API key
need to make callsWe can see the table output here:
| GEOID | NAME | variable | estimate | moe | summary_est | summary_moe |
|---|---|---|---|---|---|---|
| 51003 | Albemarle County, Virginia | RaceEthnic_%_Black | 9966 | 419 | 112513 | NA |
| 51003 | Albemarle County, Virginia | RaceEthnic_%_AmerIndian | 125 | 86 | 112513 | NA |
| 51003 | Albemarle County, Virginia | RaceEthnic_%_Asian | 6319 | 267 | 112513 | NA |
| 51003 | Albemarle County, Virginia | RaceEthnic_%_PacifIslan | 34 | 48 | 112513 | NA |
| 51003 | Albemarle County, Virginia | RaceEthnic_%_HispanLatin | 6633 | NA | 112513 | NA |
| 51003 | Albemarle County, Virginia | RaceEthnic_%_White | 85372 | 282 | 112513 | NA |
| 51540 | Charlottesville city, Virginia | RaceEthnic_%_Black | 7945 | 350 | 46289 | NA |
| 51540 | Charlottesville city, Virginia | RaceEthnic_%_AmerIndian | 70 | 42 | 46289 | NA |
| 51540 | Charlottesville city, Virginia | RaceEthnic_%_Asian | 3237 | 201 | 46289 | NA |
| 51540 | Charlottesville city, Virginia | RaceEthnic_%_PacifIslan | 0 | 28 | 46289 | NA |
| 51540 | Charlottesville city, Virginia | RaceEthnic_%_HispanLatin | 2705 | NA | 46289 | NA |
| 51540 | Charlottesville city, Virginia | RaceEthnic_%_White | 30208 | 226 | 46289 | NA |
Now that we have two rows for each race/ethnicity group, one for the
two counties we are interested in aggregating, we need to group by the
race/ethnicity categories and summarize the values. This can be done
using the dplyr package within tidyverse for
the group_by() function, and a number of functions for
summarizing the ACS data that come with the tidycensus
package. Here is the aggregating code for our example:
# Create summary variables for the combined counties
# Run for single locality, will lead to no changes in the data table
acs_data_DP05_summarize <- acs_data_DP05 %>%
group_by(variable) %>%
summarize(sum_est = sum(estimate),
sum_moe = moe_sum(moe = moe, estimate = estimate),
sum_all = sum(summary_est))
# Create percentages from estimates
acs_data_DP05_summarize <- acs_data_DP05_summarize %>%
mutate(value = round(((sum_est / sum_all) * 100), digits = 2),
name = name) %>%
select(name, variable, value)
We are grouping by the variable column, which
contains the race/ethnicity breakdown for the variable. This will
aggregate the two counties together for each race/ethnicity. Then, with
the summarize() function we are calculating the sums for
each of the count values, including the estimated counts for each
race/ethnicity category, and the sum of the total counts in each
respective county. For the margin of error (moe), we can utilize a
special tidycensus function called moe_sum(),
which takes two arguments, the moe value and its respective point
estimate.
In the next code chunk, we convert the summed counts into percentages by dividing the race/ethnicity counts by the total counts to get the proportiion and multiplying by 100. The name of the combined counties is added as a column for clarity.
Running all this provides the following output:
| name | variable | value |
|---|---|---|
| Charlottesville-Albermarle | RaceEthnic_%_AmerIndian | 0.12 |
| Charlottesville-Albermarle | RaceEthnic_%_Asian | 6.02 |
| Charlottesville-Albermarle | RaceEthnic_%_Black | 11.28 |
| Charlottesville-Albermarle | RaceEthnic_%_HispanLatin | 5.88 |
| Charlottesville-Albermarle | RaceEthnic_%_PacifIslan | 0.02 |
| Charlottesville-Albermarle | RaceEthnic_%_White | 72.78 |
Below are some resources which are helpful for exploring Census ACS data:
censusreporter.org: Helpful way to approach census tables, provides topics/table codes for navigation
data.census.gov: The census site, can search directly for tables using table codes
socialexplorer.com: Helps build tables and maps easily
https://www.census.gov/data/developers/data-sets/acs-1year.html↩︎
https://www.census.gov/data/developers/data-sets/acs-5year.html↩︎
https://www.census.gov/programs-surveys/acs/guidance/estimates.html↩︎
https://www.census.gov/programs-surveys/acs/guidance/estimates.html↩︎
https://dof.ca.gov/wp-content/uploads/sites/352/Forecasting/Demographics/Documents/How_to_Recalculate_a_Median.pdf↩︎
https://www.stat.colostate.edu/~jah/talks_public_html/isec2020/installRStudio.html↩︎